237 research outputs found
An empirical analysis of information filtering methods
The growth in the the number of news articles, blogs, images, and videos
available on the Web is making if more challenging for people to find potentially useful information People have relied on search engines to satisfy their short-term needs, such as finding the telephone number for a restaurant; however, these systems have not been designed to support long-term needs, such as the research interests of academics. One approach to supporting long-term needs is to use an Information Filtering system to select potentially useful information from the vast amount being produced everyday.
The similarities between Information Retrieval systems and Information
Filtering systems are well-established. They have prompted the use of retrieval models and methods in filtering systems, which has had some success but has been criticised as a limiting factor due to the unique challenges of document filtering. A significant difference between these systems is the use case: a filtering system is intended to push information to the user over a period of time, whereas a retrieval system is intended for the user to pull information to themselves for immediate use. The main challenge that needs to be addressed by a filtering system is the transient nature of the information published on the Web and the drifting nature of information needs. These factors lead to an uncertain interplay between the components comprising a filtering system and this thesis presents an empirical analysis of how the main system components affect performance.
The analysis explores the role of each system component independently and in conjunction with other components. The main contribution of this thesis is a deeper understanding of how different components affect performance and the interplay between these components. The outcome of this thesis intends to act as a guide for both practitioners and researchers interested in overcoming some of the challenges of building filtering system
The Role of Syntactic Planning in Compositional Image Captioning
Image captioning has focused on generalizing to images drawn from the same
distribution as the training set, and not to the more challenging problem of
generalizing to different distributions of images. Recently, Nikolaus et al.
(2019) introduced a dataset to assess compositional generalization in image
captioning, where models are evaluated on their ability to describe images with
unseen adjective-noun and noun-verb compositions. In this work, we investigate
different methods to improve compositional generalization by planning the
syntactic structure of a caption. Our experiments show that jointly modeling
tokens and syntactic tags enhances generalization in both RNN- and
Transformer-based models, while also improving performance on standard metrics.Comment: Accepted at EACL 202
Lessons learned in multilingual grounded language learning
Recent work has shown how to learn better visual-semantic embeddings by
leveraging image descriptions in more than one language. Here, we investigate
in detail which conditions affect the performance of this type of grounded
language learning model. We show that multilingual training improves over
bilingual training, and that low-resource languages benefit from training with
higher-resource languages. We demonstrate that a multilingual model can be
trained equally well on either translations or comparable sentence pairs, and
that annotating the same set of images in multiple language enables further
improvements via an additional caption-caption ranking objective.Comment: CoNLL 201
The Sensitivity of Language Models and Humans to Winograd Schema Perturbations
Large-scale pretrained language models are the major driving force behind
recent improvements in performance on the Winograd Schema Challenge, a widely
employed test of common sense reasoning ability. We show, however, with a new
diagnostic dataset, that these models are sensitive to linguistic perturbations
of the Winograd examples that minimally affect human understanding. Our results
highlight interesting differences between humans and language models: language
models are more sensitive to number or gender alternations and synonym
replacements than humans, and humans are more stable and consistent in their
predictions, maintain a much higher absolute performance, and perform better on
non-associative instances than associative ones. Overall, humans are correct
more often than out-of-the-box models, and the models are sometimes right for
the wrong reasons. Finally, we show that fine-tuning on a large, task-specific
dataset can offer a solution to these issues.Comment: ACL 202
Retrieval-augmented Image Captioning
Inspired by retrieval-augmented language generation and pretrained Vision and
Language (V&L) encoders, we present a new approach to image captioning that
generates sentences given the input image and a set of captions retrieved from
a datastore, as opposed to the image alone. The encoder in our model jointly
processes the image and retrieved captions using a pretrained V&L BERT, while
the decoder attends to the multimodal encoder representations, benefiting from
the extra textual evidence from the retrieved captions. Experimental results on
the COCO dataset show that image captioning can be effectively formulated from
this new perspective. Our model, named EXTRA, benefits from using captions
retrieved from the training dataset, and it can also benefit from using an
external dataset without the need for retraining. Ablation studies show that
retrieving a sufficient number of captions (e.g., k=5) can improve captioning
quality. Our work contributes towards using pretrained V&L encoders for
generative tasks, instead of standard classification tasks
Towards Succinct and Relevant Image Descriptions
What does it mean to produce a good description of an image? Is a description good because it correctly identifies all of the objects in the image, because it describes the interesting attributes of the objects, or because it is short, yet informative? Griceâs Cooperative Principle, stated as âMake your contribution such as is required, at the stage at which it occurs, by the accepted purpose or direction of the talk exchange in which you are engaged â (Grice, 1975), alongside other ideas of pragmatics in communication, have proven useful in thinking about language generation (Hovy, 1987; McKeown et al., 1995). The Cooperative Principle provides one possible framework for thinking about the generation and evaluation of image descriptions.1 The immediate question is whether automatic image description is within the scope of the Cooperative Principle. Consider the task of searching for images using natural language, where the purpose of the exchange is for the user to quickly and accurately find images that match their information needs. In this scenario, the user formulates a complete sentence query to express their needs, e.g. A sheepdog chasing sheep in a field, and initiates an exchange with the system in the form of a sequence of one-shot con-versations. In this exchange, both participants can describe images in natural language, and a successful outcome relies on each participant succinctly and correctly expressing their beliefs about the images. I
Structured representation of images for language generation and image retrieval
A photograph typically depicts an aspect of the real world, such as an
outdoor landscape, a portrait, or an event. The task of creating abstract
digital representations of images has received a great deal of attention in
the computer vision literature because it is rarely useful to work directly
with the raw pixel data. The challenge of working with raw pixel data
is that small changes in lighting can result in different digital images,
which is not typically useful for downstream tasks such as object detection.
One approach to representing an image is automatically extracting and
quantising visual features to create a bag-of-terms vector. The bag-of-terms
vector helps overcome the problems with raw pixel data but this
unstructured representation discards potentially useful information about
the spatial and semantic relationships between the parts of the image.
The central argument of this thesis is that capturing and encoding the
relationships between parts of an image will improve the performance of
extrinsic tasks, such as image description or search. We explore this claim
in the restricted domain of images representing events, such as riding a
bicycle or using a computer.
The first major contribution of this thesis is the Visual Dependency Representation:
a novel structured representation that captures the prominent
regionâregion relationships in an image. The key idea is that images depicting
the same events are likely to have similar spatial relationships
between the regions contributing to the event. This representation is inspired
by dependency syntax for natural language, which directly captures
the relationships between the words in a sentence. We also contribute
a data set of images annotated with multiple human-written descriptions,
labelled image regions, and gold-standard Visual Dependency
Representations, and explain how the gold-standard representations can
be constructed by trained human annotators.
The second major contribution of this thesis is an approach to automatically
predicting Visual Dependency Representations using a graph-based
statistical dependency parser. A dependency parser is typically used in
Natural Language Processing to automatically predict the dependency
structure of a sentence. In this thesis we use a dependency parser to
predict the Visual Dependency Representation of an image because we
are working with a discrete image representation â that of image regions.
Our approach can exploit features from the region annotations and the
description to predict the relationships between objects in an image. In a
series of experiments using gold-standard region annotations, we report
significant improvements in labelled and unlabelled directed attachment
accuracy over a baseline that assumes there are no relationships between
objects in an image.
Finally, we find significant improvements in two extrinsic tasks when we
represent images as Visual Dependency Representations predicted from
gold-standard region annotations. In an image description task, we show
significant improvements in automatic evaluation measures and human
judgements compared to state-of-the-art models that use either external
text corpora or region proximity to guide the generation process. In the
query-by-example image retrieval task, we show a significant improvement
in Mean Average Precision and the precision of the top 10 images
compared to a bag-of-terms approach. We also perform a correlation
analysis of human judgements against automatic evaluation measures for
the image description task. The automatic measures are standard measures
adopted from the machine translation and summarization literature.
The main finding of the analysis is that unigram BLEU is less correlated
with human judgements than Smoothed BLEU, Meteor, or skip-bigram
ROUGE
- âŚ